Skip to content

feat: migrate vault to Go binary runevault with multi-CSP installer#68

Open
jh-lee-cryptolab wants to merge 26 commits intomainfrom
epic/go-migration
Open

feat: migrate vault to Go binary runevault with multi-CSP installer#68
jh-lee-cryptolab wants to merge 26 commits intomainfrom
epic/go-migration

Conversation

@jh-lee-cryptolab
Copy link
Copy Markdown
Contributor

@jh-lee-cryptolab jh-lee-cryptolab commented Apr 30, 2026

Closes #61 (Phase 1 — Go runtime migration), #63 (Phase 2 — multi-platform release artifacts), #64 (Phase 3 — Docker-free installer).

Context

Vault was a Python+Docker stack distributed via GHCR; ops was fragile and required runtime deps. This branch ships a Go binary with a one-command installer for local and AWS/GCP/OCI deploys.

TL;DR

Rewrite vault as single-binary Go daemon runevault with one-command installer for local and AWS/GCP/OCI deploys.

Summary

Alternatives

  • Keep Python and add a Go shim — rejected: Docker-on-VM was the primary ops pain, and a static binary removes the entire runtime layer.
  • Preserve env-var fallback for compatibility — rejected per the Epic: Phase 1 — Go runtime migration + unified binary #61 decision: ships as a clean breaking change with no migration helper.
  • Move CSP provisioning into a Go installer-cli — rejected: shell + terraform is closer to what cloud admins already read and audit.
  • Sigstore keyless signing for the release — initially included, then dropped: cosign-based verification was deemed operationally heavy. SHA256SUMS + GitHub HTTPS is the trust anchor.

Test plan

  • mise run check passes (gofmt + go vet + unit tests with race)
  • mise run go:build produces vault/bin/runevault
  • mise run go:test:e2e passes against the built binary
  • Local install: sudo bash install.sh --target local succeeds; runevault status returns SERVING
  • Local uninstall: sudo bash install.sh --uninstall --target local cleanly removes service and files
  • CSP install (one of aws/gcp/oci) provisions a VM, bootstraps it, and exposes gRPC on :50051
  • CSP uninstall (install.sh --uninstall --target <csp> --install-dir ...) runs terraform destroy
  • runevault token issue|rotate|revoke|list work through the admin socket
  • Audit log writes to /opt/runevault/logs/audit.log
  • sha256sum --check --ignore-missing SHA256SUMS passes against the release artifacts

jh-lee-cryptolab and others added 21 commits April 27, 2026 12:02
Wraps tail (macOS) or journalctl (Linux) with optional -f flag.
Log path is derived from config source on macOS; Linux delegates to journald.
- Add --target <local|aws|gcp|oci> flag and interactive target menu
- Add --install-dir flag for CSP install directory override
- Add CSP dispatch functions: resolve_target, csp_preflight,
  csp_prompt_config, csp_generate_ssh_key, csp_copy_terraform_files,
  csp_render_tfvars, csp_run_terraform, csp_post_deploy, csp_summary
- Add RUNEVAULT_TLS_HOSTNAME support in generate_tls_certs() as DNS SAN
- Remove team_secret from operator→VM flow; VM auto-generates it
- Remove RUNEVAULT_TEAM_SECRET env var from user-facing interface
- Rewrite deployment/{aws,gcp,oci} cloud-init/startup-script files to
  use Go-native install.sh instead of Docker compose
- Add runevault_version variable to deployment/{aws,gcp,oci}/main.tf
- Remove team_secret variable and output from all three main.tf files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove separate vault_index_name and tls_hostname variables — team_name
now serves as both the cloud resource name and the vault index passed to
the VM via RUNEVAULT_TEAM_NAME. Fixes runtime crash after region prompt
caused by VAULT_INDEX_NAME validation on an unset variable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
csp_prompt_config ended with [[ "$csp" = oci ]] && { ... }, so on AWS
the function's last command returned non-zero. With set -e, the calling
csp_prompt_config invocation in csp_dispatch then killed the script
silently right after the AWS region prompt. Rewrite the GCP/OCI checks
as if-statements; apply the same fix to setup_system.

Also slim csp_preflight to terraform-only with a y/N auto-install prompt
mirroring local preflight, and add terraform install support to
_install_tool (brew on macOS, HashiCorp zip on Linux).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously csp_post_deploy waited for port 50051 (10 min) plus a fixed
30s sleep, then made 6 short SCP attempts. If the VM-side install was
slow or the cert hadn't been generated yet, SCP failed and the script
fell back to a useless "retry the same SCP" warning.

Replace the port wait + fixed sleep with a single SCP polling loop that
retries every 15s for up to 30 min. SCP succeeds the moment the VM has
generated /opt/runevault/certs/ca.pem, which is the precise signal we
care about. On timeout, die with a pointer to the VM-side install log.

Also pick ssh_user from $csp instead of trying ubuntu/opc in turn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The VM-side install.sh runs as root via cloud-init, so SUDO_USER is
empty and _add_invoking_user_to_group is a no-op — the canonical SSH
user (ubuntu on all three CSPs) ends up outside the runevault group
and can't reach /opt/runevault/admin.sock or /opt/runevault/certs.

Add an explicit "usermod -aG runevault ubuntu" right after install.sh
in each cloud-init / startup-script. Drop the auto-detect fallback in
install.sh now that cloud-init owns this responsibility for cloud
deploys; local installs still pick up SUDO_USER as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… summary

team_secret no longer surfaces in operator-facing output (auto-generated
on the VM, not relevant to share). Replace that section with a Next
steps block that SSHes into the VM and runs runevault commands there,
mirroring the local install's Next steps. Same block for all three
CSPs since AWS/GCP/OCI all use Ubuntu 22.04 + systemd.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Terraform's default credential chain is satisfied by the cloud CLI's
auth artifacts in practice (~/.aws/credentials, gcloud ADC file,
~/.oci/config), so verifying the CLI is installed and authenticated
catches the most common "terraform apply silently fails" cause early.

Run "<cli> <non-destructive-auth-call>" as $SUDO_USER (the user that
csp_run_terraform will run terraform under) so the check matches the
actual credential resolution path:
  - aws sts get-caller-identity
  - gcloud auth application-default print-access-token
  - oci iam region list

Also point both local install and CSP summary to "runevault logs" for
the View logs hint, replacing the per-OS journalctl/tail snippets now
that the CLI provides a unified entrypoint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update AMI/image filters across all three CSPs:
- AWS: ubuntu-noble-24.04 on hvm-ssd-gp3
- GCP: ubuntu-2404-lts-amd64
- OCI: Canonical-Ubuntu-24.04-* filtered by VM.Standard.E5.Flex compatibility

The OCI bump also fixes the launch failure on ap-seoul-1 where the
22.04 image wasn't compatible with E5.Flex shape.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OCI launches the Canonical Ubuntu cloud image, whose default SSH user
is 'ubuntu' (the 'opc' default belongs to Oracle Linux images, which
we don't deploy). The csp_post_deploy SCP polling loop was hard-coded
to opc for OCI, so it would loop forever connecting as a non-existent
user. Use 'ubuntu' for all three CSPs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_uninstall now dispatches by --target. Local keeps the existing
service + files + data flow; CSP targets call the new csp_uninstall,
which runs terraform destroy against the install dir's terraform.tfstate
and optionally removes the directory afterwards. Operators no longer
have to manually cd into the install dir to tear down cloud
infrastructure — --uninstall --target aws|gcp|oci is enough.

Also switches the interactive target prompt label between "installation"
and "uninstall" to match the active flow, and reorders main so target
is resolved before the uninstall dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- install-dev.sh prompts for enVector endpoint and API key on
  interactive local installs (previously silently used placeholder
  defaults, producing a non-functional vault on every dev local run).
- install-dev.sh forwards --uninstall to install.sh for both local and
  CSP targets, skipping the dev preflight/build path. The CSP variant
  reuses install.sh's new csp_uninstall (terraform destroy) wrapper.
- Add dev cloud-init / startup-script variants that only install
  prereqs (cosign + apt packages); install.sh + the locally built
  binary are SCP'd in by install-dev.sh after cloud-init finishes.
  AWS variant escapes \${carch} as \$\${carch} so terraform's
  templatefile() leaves the shell expansion intact.
- Switch resolve_target label between "install" and "uninstall" to
  match the active flow.
- Default team_name changed from "dev-team" to "devteam" because
  vault index names cannot contain hyphens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Realign README, CONTRIBUTING, ARCHITECTURE, AGENTS, and tests/FIXTURES
with the actual state of epic/go-migration: single-binary `runevault`
daemon, native systemd/launchd service, admin Unix domain socket,
YAML-only config (`runevault.conf`), install.sh with --target
local|aws|gcp|oci, scripts/install-dev.sh sibling, Sigstore-signed
release pipeline, and Ubuntu 24.04 cloud images. Drop stale references
to Python modules, Docker Compose / GHCR, port-8081 HTTP admin,
env-var fallback, and aspirational HA / restore-from-backup terraform
variables that don't exist. Add an [Unreleased] CHANGELOG section
capturing the BREAKING Go rewrite plus the installer, CSP, and release
pipeline work landed on this branch.
@jh-lee-cryptolab jh-lee-cryptolab self-assigned this Apr 30, 2026
@jh-lee-cryptolab jh-lee-cryptolab added enhancement New feature or request vault Rune-Vault related ci CI/CD pipeline epic Large work item spanning multiple PRs / tasks labels Apr 30, 2026
Per prior decision that cosign-based hashsum verification was operationally
heavy, remove every cosign/Sigstore dependency:

- install.sh: drop cosign from preflight tools, _install_tool case, and
  Phase 2/3 verify-blob block. Stop downloading SHA256SUMS.{sig,pem}.
  RUNEVAULT_SKIP_VERIFY now toggles the SHA256SUMS check.
- release.yaml: drop sigstore/cosign-installer step and cosign sign-blob
  step. Release artifacts are now <archive> + SHA256SUMS only.
- deployment/{aws,gcp,oci}/cloud-init.yaml + startup-script.sh: drop the
  cosign download (no longer needed since install.sh doesn't use it).
- deployment/{aws,gcp,oci}/cloud-init-dev.yaml + startup-script-dev.sh:
  replace the cosign-as-sentinel pattern with a plain
  /var/run/runevault-dev-ready file. install-dev.sh polls the new sentinel.
- install-dev.sh: drop RUNEVAULT_SKIP_VERIFY=1 (LOCAL_BINARY already
  short-circuits download_and_verify so verification never runs in dev).
- README, CONTRIBUTING, CHANGELOG, ARCHITECTURE: replace
  Sigstore/signature-verification language with checksum verification.

SHA256SUMS integrity now relies on GitHub HTTPS for the release-page download.
@jh-lee-cryptolab
Copy link
Copy Markdown
Contributor Author

https://github.com/CryptoLabInc/rune-admin/releases/tag/v0.4.0-beta.1

You could test installation from above pre-released version

jh-lee-cryptolab and others added 2 commits April 30, 2026 14:38
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
checkSecretMode was warning-only, which let runevault.conf, api_key_file,
and team_secret_file slip through with world-readable bits set. Convert
the helper to return an error and propagate it through LoadConfig and
readSecretFile so the daemon refuses to start when any secret file is
looser than 0640. Tests cover both the main config and the
team_secret_file indirection paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jh-lee-cryptolab and others added 2 commits April 30, 2026 15:02
DecryptScores and DecryptMetadata previously returned the in-band
.Error field but a nil gRPC status on five paths (base64 decode, FHE
decrypt, JSON envelope decode, DEK derivation, metadata decrypt, and
the missing-team-secret guard). Clients that key on standard gRPC
codes silently missed those failures. Map each path to InvalidArgument
or Internal so the wire-level status is consistent with the rest of
the handler set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Daemon lifecycle belongs to the OS service manager (systemd / launchd),
not to the admin socket. The endpoints duplicated systemctl and
launchctl, and the runevault-group permission model meant any group
member could trigger a process kill — too broad for a control plane.

Remove POST /shutdown and POST /restart from buildAdminMux, drop the
matching onShutdown plumbing in AdminFromConfig, delete
Vault.RequestRestart / RestartRequested, ErrRestartRequested, and the
restart-aware exit branch in main. Operators stop and restart via
systemctl / launchctl, which the install scripts already document.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jh-lee-cryptolab jh-lee-cryptolab marked this pull request as ready for review April 30, 2026 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI/CD pipeline enhancement New feature or request epic Large work item spanning multiple PRs / tasks vault Rune-Vault related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Epic: Phase 1 — Go runtime migration + unified binary

1 participant